When quantum tomography goes wrong: drift of quantum sources and other errors 
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The principle behind quantum tomography is that a large set of observations - many samples 
from a "quorum" of distinct observables - can all be explained satisfactorily as measurements on a 
single underlying quantum state or process. Unfortunately, this principle may not hold. When it 
fails, any standard tomographic estimate should be viewed skeptically. Here we propose a simple 
way to test for this kind of failure using Akaike's Information Criterion (AIC). We point out that 
the application of this criterion in a quantum context, while still powerful, is not as straightforward 
as it is in classical physics. This is especially the case when future observables differ from those 
constituting the quorum. 



I. INTRODUCTION 

A. General remarks 

The goal of quantum-state tomography pQ is to give a 
statistically reliable estimate of a quantum state p. Two 
further questions may come to mind: (i) what is the pur- 
pose of that estimate pi And (ii), why or when are we 
correct in giving an estimate of just one quantum state? 

There are at least two answers to the first question: 
our experiment may be aimed at producing a particular 
state, say, a cluster state, and we may just want to verify 
how close p is to the desired state. But that answer 
provides really just an intermediate goal. The ultimate 
goal is always to use the desired state for some particular 
quantum information processing task. So we could say 
that the goal of producing an estimate p is to be able to 
predict the future performance in a particular protocol 
of one or more unmeasured quantum system(s) produced 
by the same source. 

Now there is a nice statistical method for ranking dif- 
ferent models according to their ability to predict future 
measurement results {not on how well they fit the past 
data!), based on the Akaike Information Criterion (AIC) 
[3J. That criterion was developed entirely within a clas- 
sical context, but it ought to apply to quantum-state es- 
timation, too. We show this is true, even though we will 
point out some interesting differences between classical 
and quantum statistics. 

The motivation behind the second question is as fol- 
lows. Since we do not have full control over all physical 
quantities relevant to the quantum-state generation pro- 
cess (for example, even the best laser suffers from phase 
diffusion; and there are always spatially and temporally 
fluctuating magnetic and electric fields), the quantum 
states produced by a quantum source are not all identical. 
A possible description of the individual states of M sys- 
tems k — 1 ... M would be a sequence {pk, k — 1 . . . M} 
where each pu+i is a little different from the previous 
one (even with entanglement or correlation between the 



different systems, we can define pk by tracing out all 
the other systems). So, why would we use just a sin- 
gle estimate p in this case? One aspect of the answer 
is, of course, that we have no way of estimating each 
individual p^. A more positive answer is that multiple 
measurements of a given observable O only yield esti- 
mates of average quantities such as (O) = TrpkO or 
p n = Trpk | O n )(O n |, where the average is over those k 
on which O was measured, and where \O n ) denotes an 
eigenstate of O. These averages being linear in pk are 
determined by a single density matrix, namely the aver- 
age density matrix p = ~pk~. This simple picture has been 
made much more rigorous by Renner in [3]. He showed 
that the crucial ingredient (missing in the simple picture) 
is permutation invariance. That is, if we randomly per- 
mute the sequence of quantum systems, and then trace 
out some subset, the joint state of the remaining systems 
is to a good approximation independently and identically 
distributed (i.i.d.). In our context this means that as long 
as the quorum of observables is measured in a random 
order, then to a good approximation any one of the re- 
maining unmeasured systems can be described by a single 
density matrix p. We now discuss what may go wrong if 
we measure the observables constituting a quorum in a 
nonrandom order. 



B. Possible errors in standard quantum state 
tomography 

It is much easier to measure a given observable from 
the quorum many times in a row, before switching to 
measurement of the next observable. Such a procedure 
is standard practice, but it voids Renner's proof, and 
so it may be that there is not a single density matrix 
that can be validly assigned to the remaining unmeasured 
quantum systems. 

Let us introduce this problem with a simple example. 
Given an ensemble of 3 AT 3> 1 qubits that - we assume! 
- are identically and independently prepared, we want to 
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estimate their density matrix. So we divide them into 
three equal and sequential groups, and measure o~ x on 
samples 1 . . . JV, a y on samples N + 1 . . . 2N, and a z on 
the last N. Now, if the samples are indeed identically 
prepared in some state p, then we can safely perform the 
measurements in this order - the state p® 3N is invariant 
under permutations, so all orderings are equivalent. But 
if the source is drifting over time, the first N copies are 
best described by a mean density matrix pi, while the 
second and third sets of N qubits are best described by 
(possibly different) average states p2 and p%, respectively. 

For an amusing (albeit extreme) example, consider a 
situation where the first N copies are best described by 
pi = |+)(+|, the second group by p2 = \ and the 

third by p% — |0)(0|. The measurement outcomes in this 
case are not random at all: every single measurement 
(of a x , o~ y , <j z ) will yield eigenvalue +1. Linear inversion 
tomography will yield a radically non-positive state 

a- = ( 4* T ) ' (i) 

and maximum likelihood estimation (MLE) yields the 
projector onto p tomo 's positive eigenspace. Although both 
estimates are plausible answers to "What single matrix 
best fits the observed data?", neither one of them is of 
any predictive use at all! The source is drifting so rapidly 
and drastically that this set of 3N samples really tells us 
almost nothing about future observations. This is the 
simplest and best conclusion at which our data analysis 
should arrive. 

This is a rather extreme and contrived example of ex- 
perimental drift [below we will discuss a more common 
type of nonrandom experiment where the above cycle of 
measurements is repeated once: so we measure o~ x on the 
first N/2 copies, then a y , then a z , and then a x ,a y ,a z 
again, each on N/2 sequential copies]. More realistic 
examples show similar behavior, though. The statis- 
tics given above are actually more consistent with a dif- 
ferent (and still plausible) mechanism: When the mea- 
surement apparatus is "rotated" to perform a different 
measurement, the experimenter inadvertently "rotates" 
the samples as well. A particularly naive version of this 
could occur with photon polarization, where one way to 
physically rotate a polarizer is for the experimentalist to 
simply rotate his own frame of reference (e.g., by lying 
down). Such a passive rotation obviously fails to change 
the relative orientation of samples and apparatus. More 
realistic examples occur when similar quantum gate de- 
vices are used to (1) prepare states (e.g. EPR states) 
and (2) implement measurements. In quantum process 
tomography, this sort of pitfall is well known; it violates 
the conditions for complete positivity of processes, and 
causes negative eigenvalues just as in our example above 

All of these failures are examples of a single phe- 
nomenon: sample-apparatus correlation. In process to- 
mography, this is usually explained by correlation be- 
tween the system and its environment. In state tomogra- 



phy, there is no environment per se, but if the state of the 
fcth sample is (in any way) correlated with the behavior 
of the measurement apparatus (e.g., with what measure- 
ment it is oriented to perform), then tomography goes 
wrong. Experimental drift is a simple and easy to under- 
stand example: the sample state is correlated with time, 
and if the apparatus setting is also allowed to vary with 
time, then there will be sample-apparatus correlation. 
As noted above, this can be eliminated by explicitly ran- 
domizing the order of measurements, so that while the 
samples are still time-dependent, the apparatus is not. 
Other kinds of sample-apparatus correlation are not so 
easy to remedy. 

In the example given above, the extremity of the data - 
and the fact that the linear inversion estimate is radically 
negative - are a dead giveaway. On the other hand, linear 
inversion can produce negative estimates even with ideal 
data 0E] because of statistical fluctuations. The raison 
d'etre of MLE is to fix this negativity, but by constraining 
the estimate to positive states, MLE also hides the tell- 
tale signature of failed tomography. Moreover, negative 
estimates are not (in general) a reliable symptom even 
of drastic experimental drift. If the drifting states in 
the example above were a bit more mixed - e.g. p' k = 
\pk + j 11 ~~ then linear inversion and MLE would yield 
identical and positive density matrices. But, just as in 
the original example, those estimates would be useless 
and not predictive. 

Fortunately, there is a general solution to this problem. 
It elegantly generalizes the observation (made above) 
that a radically negative p toiao should trigger skepticism. 
It can also diagnose drift in the absence of negativity if 
the data are sufficiently rich. It is called model selection. 

The core principle is that, when tomography fails: 

1. The standard model for tomography - i.i.d. sam- 
ples described by a single density matrix - is bad. 

2. Some other model will be better. 

3. We can quantify "bad" and "better", and use the 
results to decide whether our tomography went 
wrong. 

Clearly, putting this into practice requires that we come 
up with alternative models to describe the data. Model 
design is more of an art than a science. Here, we demon- 
strate alternative models for some simple and relevant 
problems, and leave the rich problems of general and op- 
timal alternative-model design to future work. Instead, 
we focus on model selection, which means determining 
whether (i) the standard tomographic model is pretty 
good, or (ii) some other model (e.g. a drifting source 
model) is better. 

C. Akaike to the rescue 

To accomplish this, we propose, as we mentioned 
above, to use the Akaike Information Criterion (AIC) 
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g]. Widely used outside of physics [3 [5], the AIC is 
relatively unknown within the physics community. How- 
ever, it has been applied in astrophysics jS], entanglement 
verification [TO], and quantum state estimation [TTMl3| . 
Its function is to quantify (by assigning a real number) 
how well a given model describes the data from a given 
experiment. The AIC's absolute value is not meaningful, 
but the relative AIC values for multiple different models 
have a deep and useful meaning (see following section for 
a more detailed discussion of the AIC, its meaning, and 
its derivation). Their simplest use is to rank all the dif- 
ferent models, and thus to identify (a) which is the best, 
and (b) how significantly "worse" the others are. 

The AIC assigns a number ilk to each model k, given 
by [IT] 



qubit example (where just two models are sufficient) in 



fi fc := ln£ fc - K k , 



(2) 



where C k is likelihood of model k - or, if model k has 
adjustable parameters (as is usually the case), the max- 
imum of the likelihood over all those parameters - and 
Kk is the number of independent model parameters used 
in model k to fit the data [22]. The larger the AIC (Q fe ) 
is, the higher the model is ranked. While Q^'s absolute 
value is meaningless, the difference A = fife — fifc< repre- 
sents (roughly speaking) the weight of evidence in favor 
of k over k', measured in bits. So, for example, if we 
want to report a weighted average of the two models, the 
ratio of the weights assigned to models k and k' should 
be w k /wk' = exp(S! fc - Q k ,). 

The AIC's simple form admits a simple interpretation: 
fitting the data better (higher likelihood) is good, but 
extra parameters are bad. Additional parameters must 
justify their existence by improving the likelihood (a mea- 
sure of goodness-of-fit) by at least a factor of e. This 
helps to prevent overfitting. Adding adjustable param- 
eters will always improve a model's fit - but a good fit 
to past data is not a guarantee that the model will accu- 
rately predicting future measurements. Example: If we 
measure each of 3iV qubits, measuring Oj on qubits j 
for j = 1 . . . 37V, then the best possible fit to the data is 
to assume that each qubit j just happened to be in the 
appropriate eigenstate of Oj so that the probability of 
the observed data is C = l! Intuitively, this "explana- 
tion" is absurd. The AIC quantifies that intuition; that 
model requires a huge (O(N)) number of parameters, and 
the resulting penalty will overwhelm its higher likelihood, 
ensuring that its AIC is far worse than that of simpler 
models. 

To apply the AIC to our example, we need an alterna- 
tive model (the "standard model" just uses a single den- 
sity matrix for all 3N qubits). A simple alternative that 
describes experimental drift (as well as some other forms 
of sample-apparatus correlation) is to use one density ma- 
trix for each of the 3 groups of samples. This alternative 
model will always fit the data at least as well, but it may 
use more parameters [23] The AIC ranks both models, 
and quantifies how much better one is than the other. 
We perform and analyze this calculation for our single- 



Section II A and address more complicated variations on 
this theme - with multiple alternative models - in Section 

En 

To conclude this (long) Introduction, we note that 
the the appearance of maximum likelihoods in the AIC 
does not imply any privileged role for MLE estimation of 
states or any other physical quantities. The likelihood is 
a central concept in statistics, and appears in almost ev- 
ery method. In the AIC, it is used specifically to quantify 
goodness-of-fit, and (obviously) the AIC balances this 
quantity against another (model complexity). Moreover, 
the AIC is used only to rank different models. There is 
no implicit requirement that the highest-ranked model 
must be chosen exclusively (in fact, a common strategy 
is to average over high-ranked models), and even if the 
"best" model is chosen, we remain free to analyze that 
model without MLE (e.g., via Bayesian averaging). 



II. EXAMPLES 

In this Section we first treat the example from the In- 
troduction, tomography on single qubits, in more detail 
(Sec. II A). In this example, inconsistencies can arise only 
when the observed average values of a x ,cr y , a z are incon- 
sistent with each other, which in turn can only happen 
if the density matrix obtained by linear inversion is un- 
physical. The next example, discussed in II B 



also con- 
cerns single qubits, but now measurements of cr x ,a y ,a z 
are each repeated once. In this (experimentally more rel- 
evant) case inconsistencies can arise when two estimates 
of the same quantity are statistically different. Ad- hoc 
methods that just consider this particularly simple type 
of inconsistencies work just as well as the AIC. In the 
last subsection, |II C| we will consider the case of mul- 
tiple qubits, in which the validity of ad-hoc methods is 
much harder to verify, but the AIC still works in the same 
manner, thus showing the universality of that method. 



A. One qubit, part 1 

We return to tomography of single qubits, where we 
measure a x on the first N qubits, then a y on the next 
N, and a z on the last N qubits. Denote the three thusly 
observed averages by X := (a x ) ohs , Y := (o- y ) ohs , and 
Z := (Oobs- I n or der to calculate likelihoods, we need 
the frequencies of having observed spin up (+) and down 
(— ), respectively. They are given in terms of these aver- 
ages by 



fL x) 

Ay) 
J± 



1±X 
2 ' 
1±Y 

~~ 2~ ' 
1±Z 



(3a) 
(3b) 
(3c) 
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A density matrix describing just the first set of N mea- 
surements really uses or needs only one parameter, X 
(the other two parameters are, obviously, not at all de- 
termined by those data). And no matter what X is, 
there is always a perfect fit to the data. The logarithm 
of the (maximum) likelihood of such a density matrix is, 
therefore, 



ln£^ 



-NH( 



i+x • 



' (4) 

with H(.) the Shannon entropy. The same story holds 
for the next two sets of measurements, and so there is 
always a perfect fit to the data when we use the "alterna- 
tive model" with three density matrices, and that model 
needs three independent parameters. We conclude that 
the AIC assigns the following ranking to the alternative 
model: 

n a = -N{H(^) + H(^f)+H(^f)}-'3. (5) 

The performance of the "standard model" depends on 
the value of just one number. If 



R 2 



X 2 + Y 2 + Z 2 < 1, 



(6) 



there is a single maximum likelihood density matrix p 
(with purity Trp 2 = (R 2 + 1)/2) that describes the whole 
measurement perfectly, just as the alternative model 
does. The standard model also needs three parameters 
in this case, and the maximum likelihood is also the same 
as for the alternative model. So, in this case there is no 
real difference between the two models — we could pick 
Pi = P2 — f>3 = P — and we have fl s — Q, a . There is no 
reason to reject the standard model when R < 1. 

Now let us suppose that R > 1. We have then the 
choice between two descriptions: 

1. Alternative model : We describe each of the three 
measurements by their own density matrix. The 
maximum likelihood estimates of those three states 
satisfy 



TrpiCTz = X, 
Tip 2 (Ty = Y, 
Tr p 3 a z = Z. 



(7a) 
(7b) 
(7c) 



Three independent parameters are needed for this 
model. (pi,p~2,P3 are underdetermined, of course, 
but for the purpose of finding the maximum likeli- 
hood £ a the information suffices.) 

2. Standard model : We use one density matrix to de- 
scribe all three measurements together. The maxi- 
mum likelihood estimate of that state will be pure. 
There is no known method to compute it exactly, 
but a generally good approximation is given by 



Tr p s a x = X/R, 
Trp s a y = Y/R, 
Trp s a z = Z/R, 



(8a) 
(8b) 
(8c) 



and this state's likelihood is a strict (but generally 
pretty tight) lower bound on the maximum like- 
lihood for the standard model. Two independent 
parameters are needed in this model [2"4"j . 

The reason we end up with a pure maximum likelihood 
state in the standard model is that the single matrix 
fitting the data perfectly lies outside the set of physi- 
cal states (it has a negative eigenvalue), and the clos- 
est physical state lies on the boundary [5]. In the case 
of qubits, this means a pure state. More precisely, if 
the unphysical best-fit matrix p is written in its diag- 
onal form, p = J2k=+.- ^k\ipk)(ipk\, with A + > 1 and 
A_ < 0, then the maximum likelihood estimate would be 
p s = \ip+){ip + \. The latter state has the properties 
as can be easily verified by explicit calculation. 

Thus, when R > 1 the alternative model fits the data 
better but uses one more parameter than does the stan- 
dard model. We can calculate the maximum likelihoods 
analytically in each of the two models, and thus obtain 
the relative AIC score of the two models: 



n s - n a = i + n J2 



M=X,Y,Z 



1, 1-M 2 R 2 

- In ^ h 

2 1 - M 2 



M ln ( i? + M )( 1_M ) 
T n (R- M){1 + M)' 



(9) 



We accept the standard model as consistent iff J7 S > fi a . 
This will happen only if R is sufficiently close to 1. If we 
expand R around 1, we can Taylor expand the right-hand 
side of ([9| as 



1 - N 



E 



M=X.Y.Z 



{R-l) 2 M 2 
2(1 - M 2 ) ' 



(10) 



provided (i? - l) 2 < (1 - M 2 ) for M = X, Y, Z. That is, 
with this proviso, the standard model is consistent only 
when 



(R-l)< 



C 

7n' 



with the constant C given by 



C 



1 



VE M M 2 /2(l-M 2 Y 



(11) 



(12) 



The dependence of the condition (11) on N agrees with 



the simple idea that it is sufficient for R to be less than 
about a standard deviation or two above 1 for the stan- 
dard model to still apply, and that standard deviation, 
of course, decays like 1/yN for N —> oo. 



B. One qubit, part 2 

The implementation of tomography in the previous ex- 
ample is probably too simple and too obviously wrong 
for it to have been applied in an actual experiment. 
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The straightforward improvement to measure each of 
a x ,cr y , a z in two separate blocks will allow one to detect 
drift. Let us denote the 6 observed averages by Xi i2 := 
( fT x> obsl!2 , Y h2 := K) obBl)2 , andZ 1)2 := (a z ) ohsl2 . Drift 
can be detected by comparing the pairs of estimates 
with each other, Y\p, with each other, and with each 
other. The AIC works as follows: We need again at least 
two different models for describing the data. One will be 
the standard model, with one density matrix describing 
all 6 measurements. This density matrix will be deter- 
mined by the three averages {X\ + -X' 2 )/2 etc. The al- 
ternative model may consist of two independent density 
matrices (with 6 parameters in total) or of two density 
matrices that are not independent with either 4 or 5 pa- 
rameters in total. Let us test the AIC in a simulation of 
data generated by single-qubit states of the form 

Pactual = PI^XVVI + I 1 - P) 1 A (13) 

where the pure state \ipcj,) depends on an angle <f>, which 
we assume to undergo a random walk, and with the fol- 
lowing meaning: 

(Vtyl °x \ip<f) = cose/); 
(VV>| (T y 1^) = sin0; 

Wvksl^) = °- ( 14 ) 

For p we take the value p = 0.9. We perform in total 3000 
measurements, divided into 6 groups of 500, in which we 
measure a x ,a y ,a M ,a x ,a y , a z in that order. 

In Figs 1 and 2 we plot two qualitatively different cases. 
In the first case the diffusion of 4> is so fast that it leads to 
noticeably different values of X 12 and Yi 2 - The AIC in 
this case gives a very clear preference for the alternative 
model of using two density matrices with 5 parameters in 
total (only the expectation value of a z does not change 
over the course of the experiment). 

In the second case the drift over the course of the ex- 
periment is small enough so that the standard model is 
still the best, even though there is some drift, and even 
though the more complicated model does, of course, fit 
the data slightly better. 
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FIG. 1: Top: Simulation of diffusion of the angle </> in the 
state over the course of 3000 measurements. Bottom: 
the number of "spin up" results for the measurements of a x 
(in red, for measurements 1...500 and 1501. ..2000), of a y (in 
green, for measurements 501. ..1000, and 2001. ..2500), and of 
a z (in black, for the remaining measurements) . The numbers 
for <j y are statistically different, and this is reflected in the 
relative ranking the AIC accords to the different models. Here 
we have Q, a ~ Q a = —5.07, where the negative sign implies 
the standard model of a single density matrix is significantly 
worse than the alternative model (here, two density matrices 
with five parameters in total, two parameters more than the 
standard model). Tomography failed in this case. 



C. Two or more qubits 

Consider now a tomographically complete measure- 
ment on 9iV copies of two qubits, where on the first N 
pairs of qubits we measure a x on both qubits indepen- 
dently, then on the next N pairs we measure a x on one 
qubit and a y on the other, then on the third set of N 
pairs we measure a x on the one and a z on the other . . . 
until on the last (9 th ) set of N pairs we measure a z on 
both qubits. The first measurement is described by three 
independent averages that are obtained from measuring 



a x on both qubits independently: 

XX := (a x (g) cr x ) obS)1 , (15a) 
IX := (l®a x ) ohs>1 , (15b) 
XI := (<r s <8 l) obM . (15c) 

Thus, a two-qubit density matrix perfectly fitting the 
data of the first measurements needs three parameters. 
The description of the second measurement of a x on one 
qubit and a y on the other is likewise determined by three 
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This, of course, becomes exponentially worse for more 
than two qubits. 

On the other hand, the AIC can be applied straight- 
forwardly to various alternative models. It is sufficient 
to find just one alternative model superior to the stan- 
dard model in order to have succeeded in diagnosing an 
inconsistency in our tomographic experiment. Of course 
there is a large multitude of alternative models, but one 
can be guided in searching for such models by looking for 
those estimates of the same quantities that are the least 
consistent. 



III. MODEL SELECTION, THE AIC, AND 
QUANTUM QUIRKS 



1500 
trial 



FIG. 2: Same as Fig. 1, but for a case where the drift is 
much smaller over the course of 3000 measurements. Here 
Q s — Q a — 1.38, so that the standard model is better than 
the best alternative model (which has two extra parameters). 
Tomography succeeded. 



observed averages 



XY := 




» CT y>obs,2> 


(16a) 


IY := 


(18 


°y)obs,2 ' 


(16b) 


XI' := 




» 1 )obs.2 ■ 


(16c) 



The new feature arising here is that we get a second es- 
timate of the same parameter, XI in this case. That is, 
if there were only a single two-qubit state in the experi- 
ment, the estimates 
(within error bars). 



would have to agree 



|15cj and |16cj) 

Conversely, if they do not agree, we 
have encountered a new diagnosis of inconsistent tomog- 
raphy. 

Writing down all different averages obtained from this 
particular experiment, we find 9 quantities that are mea- 
sured once, and 6 other quantities that are measured 
thrice. It becomes now much harder to judge when all 
the differences between those different estimates of the 
same quantities are, in total, statistically significant or 
not. That is, the generalization of the ad-hoc method 
that worked fine for a single qubit, becomes troublesome. 



Data are generally assumed to be generated by some 
stochastic process [35] - e.g., a probability distribution 
f(x) (where x denotes the sample space containing all 
possible events). Unfortunately, these "true" probabili- 
ties are unknown to us. All we have are some data. So, 
in order to (i) describe the data; (ii) approximate the 
underlying process /; and (iii) most importantly, predict 
future observations, we use models. 

A model is just another probability distribution g(x). 
Almost always, the model contains a whole family of pa- 
rameterized distributions go(x), where 9 comprises the 
values of K distinct [real-valued] parameters. One obvi- 
ous model is the universal one where each of the proba- 
bilities g(x) - for every possible value of x - is itself a free 
parameter. This is the richest possible model, with the 
most parameters. If x takes on uncountably many values, 
this model is utterly intractable (and the AIC penalizes 
it infinitely for its richness). The ubiquity of this problem 
in statistics motivates the use of restricted parameterized 
models (e.g., Gaussian distributions) where finitely many 
parameters can specify g(x) for every possible x. 

Quantum tomography applications usually involve 
finitely many parameters, but few-parameter models are 
still important. This is partly because of the simplifi- 
cation obtained by eliminating many parameters (e.g., 
when a quantum state in 2^ dimensions is approximated 
by a matrix product state with poly (A) parameters), but 
even more importantly because it guards against over- 
fitting. This is precisely where well-designed model se- 
lection techniques come in, and the AIC is a canonical 
example. When there is a choice between different candi- 
date models describing one and the same experiment, the 
AIC provides a numerical ranking of the different models. 

The AIC (as given in Eq. (|2|) appears very simple. 
Moreover, it bears a strong resemblance to quantities 
that appear in likelihood-ratio (LR) hypothesis testing 
(see, e.g. [2]). But in fact, the AIC's theoretical un- 
derpinnings are rather different, and remarkably elegant 
(see [7] for extensive discussion). Likelihood ratios are a 
fundamentally frcquentist technique: given two compet- 
ing models, we calculate ahead of time the probability 
that various values of the LR statistic will be observed if 
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one model or the other is "correct" , and then we formu- 
late a rule for what to announce upon seeing any given 
value of the LR statistic. Many canonical results on LR 
tests require that the models be nested - i.e., that one 
be a subset of the other. In particular, given this and a 
few other conditions, it is possible to derive expectation 
values of the LR statistic that look identical to Eq. ^ 
because the loglikelihood ratio is \ K distributed, and has 
mean value K. 

But despite this similarity, the AIC is derived differ- 
ently. Akaike began by postulating that "goodness" of 
a model is quantified by the Kullback-Leibler divergence 
[15] between the model and the "true model" that ac- 
tually underlies the data. Then, rather remarkably, he 
showed that it is possible to estimate this divergence [25] 
- even when the true model is unknown! The AIC is the 
expected value of the [unknown] Kullback-Leibler diver- 
gence between a specified model and the [unknown] true 
model, conditional upon the data in our possession. So 
the AIC (i) has a powerful and universal interpretation, 
and (ii) can be used to compare arbitrary models, with- 
out any requirement for nesting. 

This is not to say that the AIC is the acme of model 
selection, nor that it is perfectly adapted to quantum to- 
mography problems. First, there are competing deriva- 
tions of other model ranking statistics, such as the 
Bayesian Information Criterion (BIC - again, see [7]). 
Moreover, the AIC is inherently an asymptotic result - 
much like, for example, the efficiency of MLE. So, even 
though there is a finite sample size correction (the AIC C ), 
this correction is part of an asymptotic expansion and 
may be unreliable for any fixed N. 

One significant consequence of this is that, for finite 
samples, an event x whose true probability is nonzero 
may not be observed - in which case a model might assign 
zero probability to it. (The MLE within the full model, 
where each probability g{x) is a parameter, behaves this 
way). This results inevitably in an infinite Kullback- 
Leibler divergence. Asymptotically, the probability of 
such a pathology occurring goes to zero almost certainly. 
But for any finite sample size it is a concern. So, beware 
of rank-deficient estimates in tomography! 

A related phenomenon is [almost] unique to quantum 
tomography. Akaike's derivation assumes that a very 
good (if not the best) measure of predictive power is 
the Kullback-Leibler divergence between the true model 
f(x) for the observed process x and the assigned model 
g{x). But in quantum tomography, the observed process 
"x" is some particular (and rather arbitrary) quorum of 
measurements that the tomographer has performed. We 
don't necessarily care about predicting those measure- 
ments! Instead, we care about the underlying quantum 
state - or, to put it more operationally, we care about 
a large and unknown set of other measurements that 
might be performed on samples of that state in the fu- 
ture. Quite frequently, we care about measurements of 
that state's diagonal basis. This completely undermines 
Akaike's assumption (that predicting x is the goal). This 



does not mean that the AIC should not be used - but it 
does strongly suggest that: 

1. Conclusions drawn from the AIC, or any other 
classical statistical method, should be treated with 
thoughtful care, 

2. Better methods may still be derived (e.g., a "quan- 
tum AIC") 

3. Estimates obtained via the AIC should not be ex- 
pected to have good properties with respect to 
quantum relative entropy (the quantum version of 
Kullback-Leibler divergence) . 

Importantly, however, there are cases where our future 
measurements will be the same as those used for our pre- 
liminary quantum tomography experiment. For instance, 
in the case of quantum computing, where error correction 
is implemented by CSS codes, all measurements will be 
Pauli measurements. In such a case, the conclusions of 
the AIC, applied to a tomography experiment that used 
Pauli measurements as well, should be trustworthy. 

IV. SUMMARY AND DISCUSSION 

Our central message here is that when the assump- 
tions of tomography fail, it is often due to some sort of 
sample-apparatus correlation, and that this can be de- 
tected with statistical reliability by model selection us- 
ing the AIC. One particular example, the drifting source, 
clearly voids the single-density-matrix model, but can be 
described naturally (and more accurately!) by multiple 
density matrices associated with different times and/or 
measurement settings. The AIC is a particularly good 
and elegant tool for identifying whether the added com- 
plexity of this model is justified. Ultimately, the point 
of model selection (especially using the AIC) is to get 
better predictions of future measurement outcomes - not 
just better fits to observed data. 

While the AIC ranks competing models, by assigning 
each model k a number through Eq. ([2|, we have 
great flexibility in what to do with that ranking. Small 
differences in AIC are not significant; if |Ofe — ilk> \ << 1, 
then both models are equally good. But even when sig- 
nificant differences exist, we may choose to use the "best" 
model exclusively, or to hedge by mixing it with lower- 
ranked models (with weights determined by their respec- 
tive AICs). We could apply Bayesian methods to the 
highest-ranked model, or use maximum likelihood esti- 
mates to choose model parameters. Choosing between 
these alternatives is beyond the scope of this paper. 

If a model-selection (e.g., AIC) analysis finds over- 
whelming evidence of sample-apparatus correlation (e.g. 
source drift), it is often possible to go beyond the con- 
clusion "tomography has failed!" What has really failed 
is the i.i.d. assumption - we have convincing evidence 
that the samples are not identically distributed. The 
joint state is therefore not (with high confidence) of De 



Finetti form (see [H>]). But it may be possible to assign 
states with a relaxed De Finetti form, and thereafter to 
do tomography with this in mind. For example, if the 
AIC declares the alternative three-state model much su- 
perior to the single-state model, one could assign a state 
of the form 



P 



(3N) _ 



J dp, J dp 2 J dp 3 p a (pi,p 2 ,p 3 )pf N ®pf N ®pf N 

(17) 

to the 3N qubits, where P a (.,.,.) is a joint probability 
distribution over three 2D density matrices. This form 
itself needs to be tested and validated, by comparison to 
a richer model (e.g., a model with 6, 9, or more differ- 
ent states) . In general, validating a model requires more 
sophisticated model design - e.g., to describe more arbi- 
trary forms of source drift - and perhaps different mea- 
surements or experiments specifically aimed at detecting 
those models, as proposed in [17J . But once a given model 
is validated, if it implies a relaxed De Finetti form as in 
Eq. (17), then we can in principle perform tomography 



independently on each of the i.i.d. subsets of the whole 
sample. 

In the simplest case of tomography on single qubits, 
we discussed two competing models. Either one uses just 
a single density matrix p to describe the experiment [the 
standard model], or one uses three-pi, p 2 , p 3 -one for each 
set of N qubits used to measure a x , a y , and a z , respec- 
tively [the alternative model]. But what does it mean to 
use three density matrices for predicting future measure- 
ment outcomes? The answer is that the predictions refer 
to measurements on qubits that have not been measured 
yet (of course). Consider one unmeasured qubit taken 
from, say, a set of TV + n qubits, from which N qubits 
were randomly picked to be measured in the o~ x basis and 
n were not measured. In this case, those n qubits would 
be assigned a state of the form 



P 



(n) _ 



J dpi j dp 2 J dp 3 P a (pi,p 2 ,pa) pf n , 



(18) 



valid for any n, including n — 1. The mixed model, as 
mentioned above, would combine the standard and alter- 
native models and assign an even more mixed state. For 
example, in the case n = 1 it would assign the estimate 

Pmhted = Wa J dpi J dp 2 J dp 3 P a {pi , p 2 , p 3 ) Pi 

+ w s J dpP s (p)p, (19) 
with w a = exp(ri Q )/(exp(r2 a )+exp(fi s )) and w s = \ — w a 



the relative weights of the two models, as assigned by the 
AIC, and with P s { ) the standard De Finetti probability 
distribution over single density matrices. 

Although we have avoided discussion of model design 
here, one simple but powerful technique deserves men- 
tion. In the example at the beginning of the paper, we 
introduced an alternative model wherein each measure- 
ment setting is associated with a different density matrix. 
When the measurements are informationally complete, 
this alternative model has precisely as many parameters 
as the standard model. But if they are overcomplete, then 
the alternative model has more parameters. As long as 
the samples really are i.i.d., we expect the alternative 
model to fit slightly better, and the AIC to declare them 
(on average) equally good. However, in the presence of 
experimental drift, we will find inconsistencies within the 
overcomplete measurement set - i.e., we will not be able 
to fit all the measurements well with a single density ma- 
trix! This is a simple test for experimental drift that does 
not rely on negativity of p tomo . 

For the main point of this paper, however, all these 
complications are unnecessary. All that matters is 
whether assigning a single density matrix to our tomog- 
raphy experiment constitutes the best model or not. If 
not, something is amiss, but at least we have diagnosed 
the problem. 

The main issue we left open is the following: is there 
a sense in which the AIC works reliably if future mea- 
surements are different than those used in our tomogra- 
phy experiment? If not, is there a "quantum" version 
of the AIC that, e.g., takes into account the quorum of 
observables that have been measured, as well as the set 
of observables that will be measured? 

(Upon completion of this paper Ref. [TB] appeared, 
which is similar in spirit to our paper, but which uses 
X 2 tests to detect errors in tomography. It points out, 
too, the problem with pure-state assignments for those 
tests.) 

(After submission of the page proofs we became aware 
of two more relevant papers: [H?I |2"D"] .) 
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